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Abstract 

Significant pattern mining, the problem of finding itemsets that are significantly 
enriched in one class of objects, is statistically challenging, as the large space of 
candidate patterns leads to an enormous multiple testing problem. Recently, the 
concept of testability was proposed as one approach to correct for multiple testing 
in pattern mining while retaining statistical power. Still, these strategies based on 
testability do not allow one to condition the test of significance on the observed 
covariates, which severely limits its utility in biomedical applications. Here we 
propose a strategy and an efficient algorithm to perform significant pattern mining 
in the presence of categorical covariates with K states. 


1 Introduction 

The goal of frequent pattern mining HI is to find patterns in a dataset of objects that occur in at least 
6 percent of all objects. Its most popular instance infrequent itemset mining, in which one is given a 
collection of sets (transactions) and tries to find subsets of elements (items) that frequently co-occur 
in the same sets. The classic example is shopping basket analysis, in which one tries to find groups 
of products that are frequently co-bought by the customers of a supermarket El. 

Significant (or discriminative or contrast) pattern mining a does not focus on the absolute fre¬ 
quency of a pattern in a dataset, but rather whether it is statistically significantly enriched in one of 
two classes of objects. Einding groups of mutations that are statistically significantly more common 
in disease-carriers than healthy controls is one of the many biomedical applications of significant 
pattern mining ||6| . 

One of the biggest problems in pattern mining is the fact that the number of candidate patterns 
grows exponentially with the number of items in a pattern. The resulting computational problem of 
enumerating frequent patterns has been intensively investigated for decades m. The accompanying 
statistical problem, that each candidate pattern consists of a hypothesis in significant pattern mining 
and that testing millions and billions of patterns for significance leads to an enormous multiple 
testing problem, has received far less attention. 

Only recently, Terada et al. Qa introduced the concept of testability, originally developed by 
Tarone lfT3]l . to pattern mining; the key idea is that if we are working with discrete test statistics, e.g. 
Eisher’s exact test, to quantify whether a pattern is statistically significantly enriched in one class, 
then many patterns are too rare or too frequent to ever become significantly enriched. Tarone argued 
that these untestable hypotheses (or untestable patterns) need not be considered when correcting for 
multiple testing (e.g. via Bonferroni correction ||4l), thereby greatly improving the statistical power 
in detecting truly significant patterns. 
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While this new approach is of great importance for high-dimensional data mining, it is limited in 
one crucial aspect: It cannot account for the effect of covariates, which are omnipresent in particular 
biomedical applications. For instance, the enrichment of particular mutations in disease-carriers may 
heavily depend on a person’s age group, gender, or the geographic subpopulation to which he or she 
belongs. It has been long appreciated by the Genetics community CD that ignoring these covariates 
leads to a surge in false positive patterns. Hence, significant pattern mining methods which do not 
account for these covariates have limited application in this domain. 

In order to bridge this gap, our goal in this article is to enable significant pattern mining in data 
that is stratified according to a categorical variable with K states. 

To this end, we propose a strategy based on the Cochran-Mantel-Haenszel test for K 2 x 2 contin¬ 
gency tables 0 and increase its statistical power by pruning non-testable patterns (Section]^. As 
this approach scales exponentially with K we also propose an exact algorithm to perform the test in 
a runtime which is only of order 0{K log K) (Section]^. Experiments on simulated and real data 
demonstrate that our approach is superior in terms of runtime, statistical power and false positive 
rate when compared to all competing methods (Section]^. We believe that our approach opens the 
door to numerous applications of significant pattern mining. 

2 Significant pattern mining and the multiple testing problem 

2.1 Significant pattern mining 

Let S = { ei,..., Cm } be a universe of m items. We call any set 5 C f containing |5| distinct 
items a pattern and assume we are given a database V = { {%, yi) containing n transactions 
Ti ^ £ with binary labels pi G { 0,1}. Informally, the goal of significant pattern mining is to find 
all patterns S whose occurrence within a transaction T is informative of the label y of T. 

Given a pattern S, let G{S,T) = 1[5 C T] be an indicator binary random variable which takes 
value 1 whenever 5 C T and value 0 otherwise. The pattern S is statistically associated with the 
class labels y if G{S,T) and the class labels y are statistically dependent random variables. Given 
a transaction database T) containing n i.i.d. labeled transactions, we can estimate the statistical 
dependence between G{S, T) and the class labels y by choosing an appropriate test statistic and 
computing the corresponding p-value. The p-value of the observed association between G{S,T) 
and the class labels y in I? is defined as the probability of obtaining a value for the test statistic 
at least as extreme as the one observed in V under the null hypothesis of statistical independence 
between G(5, T) and y. A low p-value indicates that it is unlikely that the database V was generated 
by a model in which the random variables G{S, T) and y are statistically independent. Thus, the 
pattern S and the class labels y are deemed to be statistically significantly associated if the p-value 
is below a predefined significance threshold a. From the definition of p-value it follows that the 
probability of finding an inexistent association between pattern S and the class labels y, i.e. having 
a false positive, is upper bounded by a. Therefore, the significance threshold controls the trade¬ 
off between the probability of false positives and the statistical power to discover truly associated 
patterns. 

Since both G{S, T) and y are binary random variables, one can summarise the relevant information 
from the observed database I? as a 2 x 2 contingency table: 


Variables 

G(6’,T) = 1 

G(S,T)=0 

Row totals 

y = 1 

as 

ni - as 

ni 

O 

II 

xs - as 

{n - ni) - (xs - as) 

n — Til 

Col totals 

xs 

n- Xs 

n 


n is the number of transactions, ni of which have class label y = V,xs is the number of transactions 
containing pattern S and as is the number of transactions with class label y = 1 containing pattern 
S. Popular methods to obtain p-values from 2x2 contingency tables include Fisher’s exact test ||71 
and Pearson’s x^-test Qa, among others. 

In large-scale significant pattern mining, we must compute a p-value as described above for every 
pattern 5 in a potentially huge search space of patterns, i.e. for all S G A4, A4 C 2^. Note that 
this formulation is very general and includes classical significant pattern mining problems such as 
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significant itemset mining, significant subgraph mining or significant interval search as particular 
instances. As we will discuss in section p3] the overwhelming size of the search space poses both 
computational and statistical challenges that can only be overcome by combining special statistical 
testing methods for discrete data with efficient branch-and-bound algorithms. 

2.2 Significant pattern mining in stratified datasets 

Suppose we have a collection { V ^ transaction databases defined under the same universe 

of items 8. This situation arises naturally when the data is influenced by an observed categorical 
covariate of cardinality K. In this setting, one can perform a meta-analysis and construct a separate 
2x2 contingency table for each of the K values of the categorical covariate to then combine the 
individual results. 

One of the first meta-analysis procedures is Fisher’s combined probability test il, which computes 
a p-value for each of the K 2 x 2 contingency tables independently and then combines them as 
Ffisher = — 2 F*- shown that under the global null hypothesis that there is no statistical 

association in any of the K contingency tables, pusher xtk- Another popular approach is the 
Cochran-Mantel-Hanszel (CMH) test ||3. The CMH test statistic is defined directly as a function of 
the cell counts of each of the K contingency tables as; 


Tcruh 


{Ef=iah-Y^sy 

Etir(i-r)4 (i- 


( 1 ) 


where 7 * = n\jri^. The CMH statistic defined in equation Q can be seen as an extension of 
Pearson’s x^-test to meta-analysis, as both are identical for K = 1. Moreover, it can be shown |]9l 
that under the global null hypothesis Tcmh also follows a Xi distribution for any K. Unlike other 
meta-analysis methods, such as Fisher’s combined probability test, the CMH statistic combines the 
statistical association found across databases taking the sign of the correlation into account as well. 
In this article, we will choose the CMH statistic as the meta-analysis procedure but our work can be 
readily extended to other methods just as Fisher’s combined probability test. 


2.3 Multiple testing problem in pattern mining 


The main statistical challenge in large-scale sign ificant pattern mining is the so called multiple com¬ 


parisons problem. As discussed in section 2.1 the search space of patterns M. C 2^ is usually 


extremely large, with Ai in the order of billions or even trillions. Thus, if the occurrence within 
transactions of every pattern S G Ai was tested for association with the class labels at level a, the 
overall expected number of false positives would be approximately a|Af |. For classical choices of 
the significance threshold, such as a = 0.05, one would obtain from hundreds of millions to billions 
of false positives, rendering the procedure not applicable in practice. Therefore, rather than control¬ 
ling the probability of having false positives for every pattern S individually, it would be desirable to 
instead control the FWER (Family Wise Error Rate), defined as the probability of producing one or 
more false positives when exploring the whole search space A4. That can be achieved by computing 
a corrected significance threshold 6 such that if every 5 S Ad is deemed statistically significantly 
associated when pg < 6, one has FWER < a. The most popular way to correct for multiple 
testing is to use a Bonferroni correction |4l- By upper bounding FWER((5) < SAi it follows that 
JJon = OijM. controls the EWER. This classic procedure is known to be very conservative and to 
largely lose statistical power if \Ai \ is huge as in pattern mining. 


Tarone was first to point out in lfT3]l that the discrete nature of 2 x 2 contingency tables can be used 
to compute a lower bound on the attainable p-values and obtaine an improved corrected significance 
threshold. Eor a given 2x2 contingency table with fixed margins x^, rii and n one can show that 
tks G with — mcix(0, (ti tT-i)) and os.max — min(;r. 5 , rzi). Thus, 

there are at most as,max — as,min + 1 different values that the p-value could take. The minimum 
attainable p-value is then = min{p 5 (fc) | as,min < 7 < as,max }■ By definition, only the 

patterns belonging to the set of testable patterns at corrected significance level 5, Tt{5) = {5 | 
d'(x 5 ) <5}, can be statistically significantly associated at level 5. If the test statistic of choice 
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considers the table margins fixec0then patterns S G M\It{S) do not affect the FWER at level 6. 
Thus, one can obtain an improved corrected significance threshold as: 

<5tar = max{(51 S \Tt[5)\ < a} (2) 

How to compute this number of testable patterns for improved multiple testing correction has been 
the focus of several recent studies uni El [H. These approaches exploit a link the between the 
frequency of a pattern and its testability. This connection does not hold in meta-analysis (see Sec- 
tion |3.1| Paragraph ‘Search Space Pruning’), which therefore requires the development of new effi¬ 
cient techniques to detect all testable patterns in meta-analyses. In the next section, we propose such 
a novel efficient algorithm, FastCMH, which can find patterns S associated with the class labels 
y conditioned on a categorical covariate by using the CMH test, while only considering testable 
patterns in its correction for multiple testing. 


3 Efficient exact computation of Tcmh: FastCMH 


In this section we our main contribution, the FastCMH algorithm, in detail. In a nutshell, FastCMH 
is an efficient algorithm to solve equation when the CMH statistic is used to perform meta¬ 
analysis across K transaction databases. Since equation 0 essentially poses a one-dimensional 
optimisation problem, the solution can be found via line search. However, the main difficulty 
stems from the fact that evaluating 5 \It{5)\ exactly is extremely costly and, therefore, computa¬ 
tional tricks are required to tackle the optimisation problem. We first describe the pseudocode of 
FastCMH in subsection 3.1 and then its theoretical properties. 


3.1 The Algorithm: FastCMH 

The pseudocode of FastCMH is shown in Algorithmic 


Algorithm 1 FastCMH 


1: Input: and a 

2: Output: Corrected significance threshold <5* 

3: function FastCMHfa, { V }^j) 

4: j ^ 0, (5 ^ 10“^'" 

5: buckets[j] t— 0 Vj = 0,..., Aateps — 1 

6: \It{S)\ ^ 0 

7: process_next(root, { n) 

8: Return <5*^^ = oi/ |Tt(i5)| 

9: end function 

10 : function process_next(5, { }i=i) 

11 : if tkd a ;5 }^j) < 5 then 

12 : |Xt(5)| ^ \Tt{S)\ + 1 

13: update-bucketltl/d 0:5 

14: while 5 \Tt{5)\ > a do 

15: \It{&)\ \Tt{5)\ — buckets[ji] 

16: j ^ j + 10 “^'" 

17: end while 

18: end if 

19: if is_not_prunable({ Xq then 


20: for 5' € Children(5) do 

21: Compute x''gi Vi = 1,..., A 

22: process_next(5\2;5/) 

23: end for 

24: end if 

25: end function 

26: function update_bucket(p) 

27: i ^ L-logio(p)/pJ 

28: buckets[i] t— buckets[i] -|- 1 

29: end function 

30: function is_not_prunable({ x's 

31: it3i \ x's > min(n),n* — n\) then 

32: Return True 

33: end if 

34: Tprune^ Hiax 

{0<zi<x\; 

35: prune ^ 1 77^2 (Tprune) 

36: Return 4/prune < S 

37: end function 


Initialization: Between lines |C and the main variables of CMH are initialized. We try to find 
by sampling A’steps tentative values for in a logarithmic grid with step size p in 

the exponent. Due to the discrete nature of the problem, as long as y, is small enough and A^teps 
large enough the method is insensitive to those parameters. We used y = 0.06 and Agteps = 500 

'This holds for most popular test statistics for 2x2 contingency tables, such as Fisher’s Exact Test or 
Pearson’s x'^-test. 
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throughout the article. We initialise the tentative solution 6 to 1, the largest value in the grid. Also, 
we create an array buckets, initialised to 0, such that buckets[i] equals the number of patterns 
S found so far with minimum attainable p-value 10" -(*+i)m < J < 10"*'". We also 

initialise the count of patterns S explored so far with 5'({ Xg < S, |It(i 5)|, to 0. After the 
initialisation, in line|^the enumeration process beings. 


FastCMH core: The process_next function 


Testability: We assume that the search space Ai can be arranged in the shape of a tree, with patterns 
S G Ai corresponding to nodes in such a way that children S' of a pattern . This assumption covers 
a vast amount of classical data mining problems such as itemset and subgraph mining. The function 
process_next is the core of CMH and is devoted to analysing a given pattern S and iteratively 
exploring the search space Af in a depth-first manner. 


First of all, in line 11 the algorithm checks if pattern S is testable at the current level S, i.e. if 


}fci) — requires extending Tarone’s method to the CMH test, which amounts to 

finding 


K 


h=i) = 


K 


max 


{ ^ 5 . 


rcrh({«5 h=l) 


'I^({ 4 h=l) = 1 - (^cmh ({ itl)) 


(3) 

(4) 


where we have omitted the dependence on { n\ and { n® to simplify the notation and 
F ^2 (x) denotes the distribution function of a Xi random variable. In the Appendix we prove the 
following proposition; 

Proposition 1. 


max 


K 




(e- 


K 
= 1 ' 


‘5,min ~ 1 ^Sj > 




K 


'^S^max-r^S^ 


(i- 


(5) 


The result in proposition 1 allows us to compute 'k({ }^i) exactly in 0{K) time. If the pattern 

is testable at the current ^vel 5, the block of code between lines 12 and[l7]is executed. Firstly, 
the number of testable patterns processed so far, \It{5)\, and the corresponding entry of the ar¬ 
ray buckets are increased to account for the newly processed testable pattern. Then, the FWER 
condition <5 \It{^)\ < a is checked. If not satisfied, then we must decrease the tentative corrected 
significance threshold S until the condition is satisfied again. Notice that every time S is decreased 

to the next point in the grid, the patterns S for which 10“< 'k({ x^ < 10 “^'® become 
non-testable. Therefore, |It(< 5)| is adjusted by subtracting buckets[j] in line|l5| 


Search space pruning: Next, one must determine whether children S' of the current pattern S must 
be visited or whether we can prune the search space. This step is crucial since, otherwise, we would 
need to explore the entire search space Ai, resulting in an inadmissible computational complexity. 
Since, by assumption, xs' < xs for all children S' of pattern S, when the function 'I'(x 5 ) is 
monotonically decreasing in xs G [0,min(ni,n — ni)], one can conclude that children of non- 
testable patterns with xs <, min(ni, n — rii) must be necessarily non-testable as well. This is the 
key property that has been exploited by all existing previous work in significant pattern mining based 
on Tarone’s method. Nonetheless, the function ^'({xg }f^i) does not exhibit this monotonicity 
property; 

Proposition 2. The minimum attainable p-value function for the CMH statistic, '!'({ Xg }^i). 
not monotonically decreasing with respect to each Xg in Xg G [0, min(nj, n® — n\)] if there exist 
in 1,K such that ^ < 4, ^ > 4 

’J ’ ’ n* 2 ’ n-? 2 


The proof of Proposition]^ can be found in the Appendix. In the light of Proposition]^ the case 
of significant pattern mining for meta-analysis appears to be hopeless. However, as we will show 
next, it is possible to generalize the existing Tarone-based significant pattern mining framework 
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beyond the requirements of monotonicity in The key idea is summarised in the following 

proposition: 

Proposition 3. Children S' of a pattern S can be pruned from the search space if and only if 
^prune > ^ where: 

T„= max rrh({z*}ti) (6) 

'f'prune = 1 — F^liT^mrie) (7) 

Proof Since children S' of pattern S satisfy S satisfy xs' < 2 : 5 , it follows immediately from the 
definition of Tpiune that tI/({ x'g, > 'kprune > <5 and, therefore, S' must be non-testable. □ 


Algorithm 2 Fast evaluation of Tj 


prune 


function eval_T_prune({ Xg 

A* ^ (1 - 70(1 - ^ (1 - y)(l - for alH = 1 ,..., fr 


5: 


idx_l t— argsort({ j3l idx_r t— argsort({ fl 


4: Hi[t\ := 




0 Ascending order 




idx-l[-i] 


Hr\i] := 


(E’=o 

7 idx.r[i] 

E‘=o ( 1- 


TJT) 

I -ti r 


max Hi\i] 


max Hr \i] 


6: Return max.{Hf 

7: end function 


Tjr) 

, llr 


Proposition corresponds to the most general formulation of Tarone-based search space pruning 
in significant pattern mining. Nonetheless, it would only be useful if Tpiune could be evaluated 
efficiently. A priori, the problem for the CMH test statistic seems intractable, since it corresponds to 
a discrete optimisation of a non-monotonic function over a set of size 0 ( 11^1 min(ni, n* — n\)). 
However we have devised an 0{K log(iT)) algorithm to evaluate Tp^ne for the CMH test statistic 
exactly: 

Theorem 1. Algorithm^computes the exact solution to the optimisation problem 


max 


rrimax 
^ cmh 


({^ 


}L) 


(8) 


in the region x^ S [0, min(n 2 , rd — n\)] Vi = 1,..., iT /n 0{K log(iT)) time. 


The complexity statement in Theorem [T] follows immediately from the pseudocode since, for large 
K, the runtime is dominated by the sorting operation. The proof of correctness can be found in the 
Appendix. 

Equipped with Algorithm]^ FastCMH prunes the search space efficiently and correctly in linefT^of 
the function process.next with a call to the wrapper function is_not_prunable. The latter 
evaluates Tprune for the current pattern S by using AlgorithmSand return True if children of pattern 
S cannot be pruned for the search space. Whenever the search space space cannot be pruned, the 
block of code between lines and [^iteratively continues exploring the search space A4 using a 
depth-first strategy. The algorithm naturally stops when no more testable patterns remain at a certain 
level 5, at which point the algorithm returns the corrected significance threshold = a/ \It{5)\, 
that is, the desired target EWER a over the final number of testable patterns \Xt{5)\. 


4 Experiments 

This section describes experiments which exhibit the three main contributions of the FastCMH 
algorithm: its improved detection performance (statistical power), its superior speed, and its ability 
to handle categorical data, when compared to approaches that use naive multiple testing correction 
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Figure 1; (a) A comparison of the power of FastCMH, FAIS-x^ Bonferroni-CMH for 

detecting true significant subsequences, as Pcase varies, (b) The proportion of confounded significant 
subsequences falsely detected by each of those three algorithms, (c) A plot showing the runtimes of 

FastCMH, FAIS-x^, FAIS-CMH and Bonferroni CMH. 


procedures and/or do not take categorical information into account. The advantages of the FastCMH 
algorithm become clear in situations where the data can be split into several categories and there are 
a huge number of tests; then using Tarone’s testability criterion can require many computations of 
the maximum possible value of the CMH test statistic. We study the following instance of pattern 
mining: suppose we are given binary xi [j] sequences with class labels yi G {0,1}, and the sequences 
may be split into categories 1, 2,..., K. We try to find subsequences : Tm + fm] such that 
maxXi[Tm '■ TVn +fm], the maximum over all binary variables in this sequence, correlates with class 
membership. 

Simulation study We compare three main methods: (1) FastCMH, our proposed method uses 
which Tarone’s testability and efficiently computes the CMH statistic, (2) FAIS-x^, which uses 
Tarone’s testability, but does not take the categories into account, and so does not use the CMH 
test, (3) Bonf erroni-CMH, which does not use Tarone’s testability, but does use the CMH test. 
In order to investigate the difference in power between the three methods, we contract n binary 
sequences Xi of length L, and with AT classes, there are njK samples in each class. In each class, 
half of the samples have label t/i = 0 (controls) and the other half have label t/i = 1 (cases). Initially 
each element Xi[j] is sampled from a Bernoulli distribution with probability pi of being a 1, i.e. 
Xi[f\ ~ B(l,pi). A significant subsequence starting at Tk with length ik, Xi[Tk : rx + — 1] is 

created by sampling each element from a Bernoulli distribution with parameter 1 — (1 — Pci^seY^ 
if Pi = 0. This ensures that the probability of at least one 1 occurring in the subsequence Xi : 
Tk + ik — 1] is Pease- If the label yi = 0, then the subsequence is left as before (with probability pi 
of each entry being a 1). 

Power There are two complementary situations where FastCMH has improved power: first, it has 
improved detection power of true significant subsequences, due its use of Tarone’s testability cri¬ 
terion, when compared to Bonf erroni-CMH; second, it will often (correctly) omit subsequences 
which appear to be significant, but are actually highly correlated with the categorical labels rather 
than the class labels. Recall that if /? is the probability of a method returning a false negative (i.e. 
the Type II error), then the (statistical) power is defined as 1 — /?. In other words, when attempting to 
detect ^nificant subsequences, it is the probability of correctly detecting a significant subsequence. 
Figure fUa) shows the results of running the FastCMH, FAIS-x^ and Bonf erroni-CMH meth¬ 
ods over the data generated with parameters: n = 500, L — 10, 000, pi = 0.2, K = 2, and one 
significant subsequence of length 5 at location 2500 in each sequence. In this setting, each category 
is generated in the same manner. 

One notices that both FastCMH and FAIS-x^ have higher power than Bonf erroni-CMH for 
Pease C [0.3,0.8]. In fact the results for FastCMH and FAIS-x^ are so similar that they are virtually 
identical. A similar experiment can be performed, by fixing the value of pcase and allowing the 
length of L to vary between 1000 and 100,000, while keeping the other parameters the same as 
before. The results of this experiment are shown in the Supplmentary material, and show that when 
L is very large, the power of the Bonf erroni-CMH method can become zero, while the power for 
FastCMH and FAIS-x^ are non-zero. 


7 



























Table 1: Number of significant patterns found by x^-test and FastCMH 


Datasets 

YEL 

LY 

EES 

X^-test 

1069 

26 

20 

EastCMH 

73 

2 

8 


So far, the sequences for the different categories have been generated in the same manner, which 
leads to little difference between FastCMH and FAIS-x^. In the Supplementary material a pro¬ 
cedure is described for contstructing a confounded significant subsequence, which is a subsequence 
that is highly correlated with the categorical labels, and the categorical labels themselves are corre¬ 
lated with the class labels yi. An experiment, the same as before but now with only these confounded 
significant subsequences, can be conducted. The results, in Figure[TJb), show that FastCMH and 
Bonferroni-CMH do not detect these confounded significant subsequences, but the FAIS-x^ 
method, which does not take categorical labels into account and does not use the CMH test, often 
incorrectly detects these confounded significant subsequences. This illustrates the value of using 
the CMFI test, and in particular the FastCMH method. In the next set of experiments, we show 
that FastCMH is significantly faster and more efficient than the brute-force Bonf erroni-CMH 
method. 

Speed In addition to the three methods described above, we investigate the speed of (4) FAIS-CMH, 
a method which uses Tarone’s testability criterion, but computes CMH in a naive way. Theo¬ 
rem 1 shows that FastCMH is equivalent to this naive approach, so it was not included in the 
power simulations above. Figure [^c) shows that FastCMH is several orders of magnitudes faster 
than BonfeiToni-CMH and FAIS-CMH. In fact, FAIS-CMH, which computes the maximum CMH 
statistic in a naive manner, increases drastically as K increases, while FastCMH increases almost 
linearly. FastCMH is comparable in speed to FAIS-x^, but recall that FAIS-x^ does not take 
the categories K into account, and that Figure [^b) shows that it is susceptible to falsely detect¬ 
ing confounded significant subsequences. Overall, these experiments have shown that our method 
FastCMH is significantly faster and has better detection performance in comparison to competitor 
methods. 

Real Data Application: Arabidopsis We perform significant subsequence search on genetic data 
from Arabidopsis thaliana a, a plant model organism. Each plant is represented by a sequence of 
214,051 binary variables (variants in its DNA sequence) and belongs to one of two classes (pheno¬ 
types). Significant patterns are subsequences of binary sequence variants that are correlated with the 
phenotype of the plant. We focus on three relevant datasets, each describing the presence/absence of 
a particular phenotype (YEL, LY, EES). Covariates represent geographic origin of the plants, here 
represented by four distinct clusters. We compare EastCMH to a x^-test on each dataset and ob¬ 
serve that conditioning on the covariates drastically reduces the number of significant subsequences 
found. Randomly permuting the covariates and repeating the experiment 150 times shows that this 
is a statistically significant reduction in the number of patterns found. The p-values based on this 
null distribution associated to each dataset are (0.007,0.007,0.070), respectively. This indicates that 
EastCMH manages to remove false positives findings that were caused by not considering geo¬ 
graphic origin as a covariate. (See Supplement for a detailed description of the experiment.) 

5 Conclusions 

In this article, we have presented an approach to significant pattern mining in the presence of categor¬ 
ical covariates with K states. Key to our approach is that we define a variant of the Cochran-Mantel- 
Haenszel test for K 2 x 2 contingency tables a, which prunes the vast search space of candidate 
patterns by excluding non-testable patterns.We propose an efficient, exact algorithm to compute this 
test. Its runtime is only linear in K, whereas the standard approach scales exponentially in K. In our 
experiments, it combines computational efficiency with high statistical power on simulated and real- 
world datasets. Our findings open the door to important applications of significant pattern mining, 
for instance, in personalized medicine, which we will explore in future work. 


8 





A Appendix 


This appendix provides proofs of the theoretical results described in the main part of the paper, 
further details about the experimental study, as well as additional experiments. 

Proofs 

Proposition 4. 


max 


K 




(EZ: 


= 1 ^S,min T 




i—l ^S,max ^ 


EZii"i.^-E>s (i- 


(9) 


Proof. The CMH test statistic can be written as: 


Tcmh = 




(as - Ef=i Exh) 


Ef=i 7*(1 - l')x's (l - Ef=i r(l - 7^0:^ (l - 


( 10 ) 


where 05 = Ef=i<^s- Moreover, we have 05 G '\JlZi ,min ^EZi ,max\- Clearly, Tcmh 
is a quadratic function of as with positive semidefinite Hessian over a compact set. Therefore, its 

Vi = 1,...,iV or 

□ 






maximum must be attained in the extremes; that is, either when Co = a\ 


'S,min 


when a^s = a^s.max Vi = l,...,iT. 


, K 


Proposition 5. The minimum attainable p-value function for the CMH statistic, T'({ 

not monotonically decreasing with respect to each Xs in x's G f),m.in(n\,rd — n\)] if there exist 


i,i ini,... ,K such that ^ < 4, ^ 4 

Proof. We just need to show that the case 




max ({ 2 * }.^ J f (xs, Xs,...,Xs) 


( 11 ) 


can occur in practice. That can be readily seen during the proof of Theorem 1, so we defer the proof 
of Proposition 2 to that. □ 


Theorem 2. Algorithm 2 computes the exact solution to the optimisation problem 


, K 


max 


( 12 ) 


in the region x's V [0i min(nj, n® — n\)] = 1,..., K in 0{K log(iT)) time. 

We decompose the proof of Theorem in several parts. First of all, note that when x's C 
[0,min(ni,n® — n\)] Vi = 1,... ,7T, the maximum attainable CMH test statistic can be 

written as x% }Zi) = max(r,({ x^ Tr({ x's }^i)) where: 


K 


mx's},^,) = 


{E^=iEx's 


) 


EZi 1"(^ - l")xs ( 1 - 

(EL(I-EH)' 

Ef=ii"{^-E)xs ( 1 - 

Using that, first we prove the following lemmas: 


Tri{Xs}i^l) = 


(13) 


(14) 
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Lemma 1. The functions Ti{{ z’’}f^^) in the region 0 < z'^ < Xg with < 
min(n5^,n® — n\)\ Vz = 1,... ,K as a function of a single variable z^ while keeping the other 
K — 1 variables fixed satisfies one of the three conditions: (1) it is monotonically increasing in 
0 < < x^; (2) it is monotonically decreasing inQ < z^ x^g or (3) has a single local minimum 

in 0 < z^ < x^g. 

As a consequence, both functions T/({ z* }^i), ^r({ 2 * are maximized with respect to z^ while 

the other variables are kept fixed either when z^ = 0 or when z^ = Xg. 


Proof Computing the partial derivative of Ti{{ z* and Tr{{ z® with respect to z^ yields: 


dm z® }ti) ^ }fmM{ iti)) 


dz^ 


dTr{{ z® ^ }lf))Ar{{ z® 


dzi 


with 


K 






(Eti7®(l-7®)^®(1- $)) 


A.({z®}L)= 


(Ef=i7®(l-7®)^®(1-S))^ 


K 


K 


E 7V 2(l-y)(l--)-(l-7^) +(W) + E | z^ 


K 




m{z"}f=i)= E (1-7')2:* f27®(l - -7^")+7^ I (1-7^) + ^ ^ (l-7®)z®|z' 






Because A/({ z® }^]^) > 0 and Ar({ z® }f-i) > 0, the sign of the partial derivatives are determined 
by the sign of A;({ z® and A^d z® respectively. In both cases, A({ z® can be 

expressed as A({ z® }^i)) = b{{ z® }f^i + p.{{ z® affine function 

of z^ where the intersect and slope is controlled by all other K — 1 variables. Moreover, regardless 
of { Iti i/j’ tH i^j) — f*- Therefore the partial derivatives are either always positive, 

always negative, or negative until a unique point where it crosses zero (minimum) and then positive. 

□ 


Lemma 2. The functions Ti{{ z® }j^i) and Tr{{ z® in the region 0 < z® < Xg, with Xg < 
min(nd n® — n\)\ Vz = 1,..., AT, are maximized when either z® = 0 or z® = x'g for all i = 
1,..., K. In other words, the size of the potential set of optimal values for { z® is 2^. 


Proof The proof is analogous for both T/({ z® }j^i) and Tr{{ z® Therefore, we focus on 

Ti{{ z® without loss of generality. We can write: 

max TAf z® ) = maxmax... maxTi({ z® (15) 

By Lemma [T] 

maxri({ z® = max (T/({ z® , 0), T/({ z® , xf)) (16) 

The same argument can be then applied recursively to the functions 21({ z® , 0) and 

mz^}t 1 i^K i^s) ia the variable z^ ^ and so on until the conclusion of the theorem fol¬ 
lows. □ 
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One can rewrite both function T/({ z* and T^d z* generically as: 

{Ehhia.))' 


( 17 ) 


with ai € [0,1] and li(ai) > 0. In particular, = (1 — Y)(l — ^), li(ai) = nl — prd) 

for r;({z*}^d and = 7d(l - ^), h(ai) = (n* - nl) (^1 - ^1^) for Tr({ }^^). Let us 
introduce K binary indicator variables 5x,...,5k in T: 




,5k) = 


{j2f=i 5Ma,)'^ 


Y)f=i 5iadi{ai) 

The only actual requirements we assume for T((5i,..., 5 k ) as defined above are that with ai G 
and li{ai) > 0. Suppose further that: 


(18) 

[ 0 , 1 ] 


argmax T(5i,... ,5k) 
6i,...,Sk 





R K-R 


(19) 


holds with i? > 0. Informally, equation ( [T^ means that the maximum is achieved by keeping the 
terms in the summation corresponding to the R smallest ai. Since we know that Tdn“({ Xg }^i) = 
max(T/({ Xg }f=i), Tr{{ Xg }^i)), and both T/({ z® }f^i) and Tr{{ z® }f^i) are maximised when 
either z® = 0 or z® = Xg by lemma|^ then it follows that if aj > ai and z^’* = x^g then z®’* = Xg. 
But that is exactly the condition that Algorithm 2 exploits to reduce the computational complexity 
of evaluating Tprune from the value 0{2^) resulting from lemma|^to just 0(iT log(Ar). In other 
words, to finish proving Theorem 1, we just need to show that equation ([T^ holds. 


We will prove it by induction. First, we show that the statement holds for K = 2. 

Lemma 3. Following all the previous definitions, we have that: 

argmax T{5i,52) G {(1,0), (1,1)} (20) 

<5i ,(52 


Proof. The only possible contradicting case would be argmax T{5i, ^ 2 ) = (0,1), since the case 

<51,^2 

(0, 0) yields a trivial value for the function T. We show directly that under the assumption a\ < 0 : 2 , 
the contradiction cannot happen. Indeed we have: 


(^i(q;i) + ^ 2 ( 0 : 2 ))^ _ 

aiZi(ai) + 02 ( 2 ( 0 : 2 ) 02 ( 2 ( 02 ) 


((l(oi) + h{ci2))‘^C^2hio‘2) — ^2(Q^2)(o1^i(oi) + 02^2(02)) 

(Oi(i(ai) + a;2(2(02))02(2(02) 

( 21 ) 

( (a ^ (^l('^l) + 2^2 (o2))o2(2(o2) — 01(2(02) 

(01(1(01) + a2l2 {0:2))0:212 (0:2) 

I (o ^'^2 ^i(0i)( 2(02) + (202 — 0i)(|(02) 

(oi(i(oi) + 02(2(02))02(2(02) 


Since (^(oi) > 0 and oi < 02 , it follows that the numerator in the expression above is positive, thus 
T(l, 1) > r(0, 1 ) contradicting the statement that argmax T(i5i, ^ 2 ) = (0,1). □ 

51,^2 


Now we prove the induction step. Suppose the statement holds for an arbitrary dimension K, we 
will show then it also holds for dimension K + 1. 

Lemma 4. If we have: 


arg max 


(E^i^i(*(ai)) 

Sill 5iaili{ai) 


( 1 , 1 ,..., 1 , 0 , 0 ,..., 0 ) 


K-R 


( 22 ) 
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Then: 


^ih{oii) + 6 K+llK+l{otK+l)^ 

argmax —-= (1,1,..., 1,0, 0,..., 0) (23) 

5 i,...,Sk,Sk+i ^K+l<^K+llK+l{otK+l) '' -^^ '- 


Proof. We can start by writing; 

(SiLi ^ih{oii) + Sk+iIk+i{o:k+i)'^ 


max 


Siaikiai) + 6 K+iaK+ilK+iiaK+i) 

(j 2 f=i (j 2 ^=i ^ih{ai) + lK+i{aK+i)j 


(24) 

2 


= max I max ^ , max ^ 

5i,...,Sk J2i=l^i0^ih{az) + (^K+llK+l{aK+l) 


If; 


max 




> max 




(25) 


J.XXCl'^V ^ xxxcx^^ 

—q SiCXili(cXi) -\- CT (jT-t-i (q ) 

Then the statement is trivially true. Suppose now that equation ( |25] l does not hold. We show next 
that; 

{^f=l ^ih{oti) + Ik+i{o!k+i)^ 


(<5i, ■■■,Sk) = argmax 


= (1,!,...,!) (26) 


5i,....S k + ax+llR+lioiK+l) '-' 


which would complete the proof. To show that equation ( |2^ is true when equation ( [25] ) does not 
hold, we proceed by contradiction in two steps. First we prove that there is at most a single j G 
{1,... ,K} I = 0. To see that, suppose 3j | j j = 0 and define; 

{Yh= 1 ^rh{on) + 6K+llK+l{aK+l)) 

r(5i,..., ^,-1,5,+i,..., 5;^, 5k+i) = - ’ , , , —-- , (27) 

+ OK+lOtK+l^K+l{O^K+l) 

which is nothing but T{ 6 i,S 2 ,.. ■ ,Sk, 5k+i) with the j-th term removed. Note that, since 5j = 0, 
we have; 

~ + lK+i{.aK+i)\ 

T(5i,... ,5j-\,5j+i,... ,5 kt5k+i) > - 


max 


(5i ,5j+i ,...,(5ic,(5ic+i 


Si,...,Sk Y^f^i Siaili{ai) + ax+ilK+iioiK+i) 

(28) 


Moreover, since equation ( |25] l does not hold, it follows that; 

(^ 1 , • ■ •, ^j+i) • • •) Sk, Sk+i) = argmax T{Si,..., Sj-i,Sj+i,..., 6 k, 6k+i) 

(29) 

satisfies 5k+i = 1 (otherwise equation ( |25| l would be true). But, since the monotonicity property is 
assumed to be tme for problems of dimension K, it turns out that 5i = 1 for i = 1,..., j — 1, j + 

1,..., iT as well. And, since Sr+i = 1, then; 

~ {^f=i6ili{ai) + lK+i{otK+i)\ 

T{6i,... ,5j-i,6jj^i ,..., Sr , — max ^ , / \ , , “ 

1 di,...,dK + aR+ilR+i{aR+i) 

(30) 

which in turn implies 6i = 1 for i = 1, ... ,j — l,j + 1, ..., K. Thus, j is the only dimension which 
could satisfy 6j = 0. 


max 
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To end the proof, we need to show that, indeed, it’s not possible to have Sj = 0 either. To do so we 
will show that the statement of monotonicity holds for K = 3, then we could easily show Sj = 1. 
We use a change of variables to make it clearer. 


Indeed, we rewrite T{ 6 i,S 2 , Sk-i,Sk) the following way ; 


T(i5i, ^2,..., Sk-i,Sk) = 


(/o + + /kY 

O^ofo + + cukJk 


with 


and 


/o = ^ fi 


/otto 


fioii ^ ao = 




Then, if equation (|25]l does not hold, we would have; 


max T{5i,...,5k,5k+i) 
<5i,---,ok,ok+i 


max 

5o,5j,6K+i 


{SqIo + Sjlj + Sk+iIk+iY 
Soaolo + + ^K+ictR+ilx+i 


(31) 


where we know, by assumption, that the optimum in the right hand side is achieved when (5o = 1 and 
5k+i = 1- If we knew monotonicity holds for K=3, it would then follow that Sj = 1 if aj > a^. 


□ 


To rephrase it, we want to show that the two following cases; T[5j = l,5o = 1,6k = 
1) < T{6j = 0,^0 = 1,Sk = 1) with aj < ao and T{Sj = l,(5o = 1,6k = 
1) < T{Sj = 0,(5o = 1)5k = 1) with Uj > are impossible with the hypothesis that 
V {^1, (52, 6k-i\ T{6i,S2, ■■■, (5_fs-_i, 0) < ■■■, 6k-i, 1) 

First, we show that when aj < ao, then T{Sj = 1, = 1, <5^ = 1) > T{Sj = 0, = 1, 5k = !)■ 

Indeed, after developing the difference we obtain ; 


T{6j = l,6o = 1, Sk = 1)- T{S, =11,6o = 1, 6k = 1) 

= , -wt— - 7 - 7 - \(^o(^^0 ~ Oij) 

[I'OOlo + iKOLKjKijaj + Loao + Lkoik) 

+ l^{2aK ~ ai) + )o^if(2Q!o + 2aK ~ aj + loljao + iK^jaK)) 

As aj < ao < aK, all the terms of the previous sum are positive, which implies that T{Sj = 

l,So = 1,6 k = 1) > T{Sj = 0,6o = 1 ,Sk = 1). 

In a second time we want to show that the case T{6o = l,Sj = 1,Sk = 1) < T(6o = l,6j = 
0, 6 k = 1) with aj > ag is not possible either. In this case we use a Reductio ad absurdum; we are 
going to show that we can not have both T(6o = 1, 6j = 0,, 1) > T(l, 1,1) and T(6o = 1, 6j = 
0,1) > T(6o = 0, 6j = 1,0). Indeed after developing both inequalities, we hnd 

T(^o = 1,<5, =0,1) >r(l,l,l)4^ao < (35) 

lo ^(lo -r IK) + 

T{So = 1,6, = 0, 1) > T{6o = 1,6j =0,0)^ao> , aK (36) 

Zlg + Ik 


(32) 

(33) 

> 0 (34) 
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The first inequality of 35 can be simplified the following way, by using the following inequalities 

ao < aj < uk and Vf k > 0. 


ao < 


< 


1 . (^0 + IkY 

Iq 2(^0 + Ik) + Ij 
1 / (^0 + Ik )^ 

Iq 2{Iq + Ik) + Ij 


ukIk) 


ukIk) 


1 (^0 + Ik )^ 

Iq 2{Iq + Ik) + Ij 


Using[^and|^we have the following result: 


(37) 

(38) 


Iq ^ ^ 1 ^ (Zo + Ik)^ 

21q + ^ Iq ^ 2{Iq + Ik) + Ij 

Iq 1 / (^0 + Ik)'^ , X 

21q + Ik Iq ^{Iq + Ik) + k 
^ 0 < —{Iq + Ik)^{Ik + Ij) 


Ik) 


The last line of the previous equation set shows clearly the contradiction. 
Those two results l^andj^end the proof. 


(39) 

(40) 

(41) 


Real Data Application: Arabidopsis 

We evaluate the ability of FastCMH to detect significant contiguous patterns while correcting for 
confounders from an association mapping study for Arabidopsis thaliana, a small flowering plant. 
Significant contiguous patterns are groups of SNPs (Single Nucleotide Polymorphisms, i.e. muta¬ 
tions occuring on the DNA sequence) significantly correlated with traits or phenotypes (as the size of 
the plant, its resistance to bacterias, the shape of the leaves...). For Arabidopsis data it is well known 
that geographic location is a major covariate because it is strongly related to population structure. 

Data description 

The data are a widely used Arabidospsis thaliana GWAS dataset by Atwell et al. (2010) from online 
resource eaiyGWAS (Grimm et al., 2012). This dataset is composed of 107 tables, related to a 
continuous or dichotomous phenotype, for at most 194 samples of inbred lines and a total of 214,051 
SNPs. Out of those 107 phenotypes, there are 21 dichotomous phenotypes. A previous study found 
more than 2 significant patterns in 7 out of the 21 tables. For the experiments we took among them 
the three phenotype names YEL, LY and LES for which the population structure, measured by the 
genomic inflation factor Xgi > 2, is the biggest. All of them contain 95 sample lines and 214,051 
SNPs. We clustered the sample lines according to geographical locations, a first cluster containing 
the samples harvested outside Europe (No = 25), and three other European clustered. Eor those, we 
visually clustered the remaining samples using the biggest components of a Principal Component 
Analysis, resulting into a Western (Af^, = 23), a Northern (A^„ = 20) and a Central European 
(No = 27) cluster. 

Experiments and results 

We ran a Chi2-test and EastCMH on each of the dataset. It appears that for those samples with a 
high population structure EastCMH finds less significant patterns than the Chi2-test as it is shown in 
Table 1 in the main text. To confirm this result we ran a second experiment: for each phenotype we 
randomized the geographical labels among the samples, 150 times, and calculated respectively the 
number of significant patterns. The number of patterns found by EastCMH on the original tables is 
each time situated in the lower tale of the distribution of the number of patterns in case of a random 
covariate. It shows that our method is correcting for confounders by decreasing the number of false 
positives. 
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Figure 2: A figure comparing the power of FastCMH, FAIS-x^ and Bonf erroni-CMH as the 
length of the sequences L varies. 


Generating confounded significant intervals 

We now describe a procedure for generating a confounded significant interval, as mentioned in the 
Experiments section in the main text. Consider the covariance matrix 

/I 0 Psig \ 

E = ( 0 1 Peon 1 

\ Psig Peon 1 / 

Consider sampling a single z from the multivariate Bernoulli distribution with mean (0.5, 0.5,0.5) 
and covariance matrix E. This can be done in R using the bindata package, which results in a 
three-vector (x, c, y). 

This z will be three-dimensional; the first component zi is an indicator function, which indicates 
(1) if that sequence Xi will contain a significant subsequence, or not (0). The second component Z 2 
is the categorical label, and the third component z^ will be the class label yi. 

Then, for another variable Z 4 ~ for = 0.1, take Z 5 = XOR( 22 , 24 ) to indicate if that 

subsequence will have a confounded significant subsequence. 


Additional Experiments 

Below are additional experiments showing the comparing the power and speed of the various algo¬ 
rithms described in the Experiments section in the main text. 
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n=500,K=4 



Figure 3: A figure comparing the speed of FastCMH, FAIS-CMH, FAIS-x^ and 
Bonf erroni-CMH as the length of the sequences L varies, for K = A 



Figure 4: A figure comparing the speed of FastCMH, FAIS-CMH, FAIS-x^ and 
Bonf erroni-CMH as the number of samples n varies, for K = A 
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Figure 5: A figure comparing the speed of FastCMH, FAIS-CMH, FAIS-x^ and 
Bonferroni-CMH as the length of the sequences L varies, for K = 8. Notice the increased 
difference between FAIS-CMH and FastCMH, when compared to Figure]^ 



Figure 6: A figure comparing the speed of FastCMH, FAIS-CMH, FAIS-x^ and 
Bonf erroni-CMH as the number of samples n varies, for K = 8. Notice the increased difference 
between FAIS-CMH and FastCMH, when compared to Figure]^ 
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